{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# 案例：波士顿房价预测"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn import linear_model\n",
    "from sklearn.datasets import load_boston\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.metrics import mean_absolute_error"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 1、获取数据\n",
    "### 1.1 波士顿房价数据在sklearn中已经内置，可以通过load_boston()方法获得"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "outputs": [],
   "source": [
    "boston = load_boston()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "##### 特征含义\n",
    "CRIM：城镇人均犯罪率。<br/>\n",
    "ZN：住宅用地超过 25000 sq.ft. 的比例。<br/>\n",
    "INDUS：城镇非零售商用土地的比例。<br/>\n",
    "CHAS：查理斯河空变量（如果边界是河流，则为1；否则为0）。<br/>\n",
    "NOX：一氧化氮浓度。<br/>\n",
    "RM：住宅平均房间数。<br/>\n",
    "AGE：1940 年之前建成的自用房屋比例。<br/>\n",
    "DIS：到波士顿五个中心区域的加权距离。<br/>\n",
    "RAD：辐射性公路的接近指数。<br/>\n",
    "TAX：每 10000 美元的全值财产税率。<br/>\n",
    "PTRATIO：城镇师生比例。<br/>\n",
    "B：1000（Bk-0.63）^ 2，其中 Bk 指代城镇中黑人的比例。<br/>\n",
    "LSTAT：人口中地位低下者的比例。<br/>\n",
    "MEDV：自住房的平均房价，以千美元计。<br/>"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ".. _boston_dataset:\n",
      "\n",
      "Boston house prices dataset\n",
      "---------------------------\n",
      "\n",
      "**Data Set Characteristics:**  \n",
      "\n",
      "    :Number of Instances: 506 \n",
      "\n",
      "    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n",
      "\n",
      "    :Attribute Information (in order):\n",
      "        - CRIM     per capita crime rate by town\n",
      "        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n",
      "        - INDUS    proportion of non-retail business acres per town\n",
      "        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n",
      "        - NOX      nitric oxides concentration (parts per 10 million)\n",
      "        - RM       average number of rooms per dwelling\n",
      "        - AGE      proportion of owner-occupied units built prior to 1940\n",
      "        - DIS      weighted distances to five Boston employment centres\n",
      "        - RAD      index of accessibility to radial highways\n",
      "        - TAX      full-value property-tax rate per $10,000\n",
      "        - PTRATIO  pupil-teacher ratio by town\n",
      "        - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town\n",
      "        - LSTAT    % lower status of the population\n",
      "        - MEDV     Median value of owner-occupied homes in $1000's\n",
      "\n",
      "    :Missing Attribute Values: None\n",
      "\n",
      "    :Creator: Harrison, D. and Rubinfeld, D.L.\n",
      "\n",
      "This is a copy of UCI ML housing dataset.\n",
      "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n",
      "\n",
      "\n",
      "This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n",
      "\n",
      "The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\n",
      "prices and the demand for clean air', J. Environ. Economics & Management,\n",
      "vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n",
      "...', Wiley, 1980.   N.B. Various transformations are used in the table on\n",
      "pages 244-261 of the latter.\n",
      "\n",
      "The Boston house-price data has been used in many machine learning papers that address regression\n",
      "problems.   \n",
      "     \n",
      ".. topic:: References\n",
      "\n",
      "   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n",
      "   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(boston.DESCR)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,\n",
      "        4.9800e+00],\n",
      "       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,\n",
      "        9.1400e+00],\n",
      "       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,\n",
      "        4.0300e+00],\n",
      "       ...,\n",
      "       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n",
      "        5.6400e+00],\n",
      "       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,\n",
      "        6.4800e+00],\n",
      "       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,\n",
      "        7.8800e+00]]), 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,\n",
      "       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,\n",
      "       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,\n",
      "       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,\n",
      "       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,\n",
      "       35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,\n",
      "       19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,\n",
      "       20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,\n",
      "       23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,\n",
      "       33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,\n",
      "       21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,\n",
      "       20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,\n",
      "       23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,\n",
      "       15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,\n",
      "       17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,\n",
      "       25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,\n",
      "       23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,\n",
      "       32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,\n",
      "       34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,\n",
      "       20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,\n",
      "       26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,\n",
      "       31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,\n",
      "       22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,\n",
      "       42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,\n",
      "       36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,\n",
      "       32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,\n",
      "       20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,\n",
      "       20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,\n",
      "       22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,\n",
      "       21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,\n",
      "       19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,\n",
      "       32.7, 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9, 24.1,\n",
      "       18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6, 25. , 19.9, 20.8,\n",
      "       16.8, 21.9, 27.5, 21.9, 23.1, 50. , 50. , 50. , 50. , 50. , 13.8,\n",
      "       13.8, 15. , 13.9, 13.3, 13.1, 10.2, 10.4, 10.9, 11.3, 12.3,  8.8,\n",
      "        7.2, 10.5,  7.4, 10.2, 11.5, 15.1, 23.2,  9.7, 13.8, 12.7, 13.1,\n",
      "       12.5,  8.5,  5. ,  6.3,  5.6,  7.2, 12.1,  8.3,  8.5,  5. , 11.9,\n",
      "       27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3,  7. ,  7.2,  7.5, 10.4,\n",
      "        8.8,  8.4, 16.7, 14.2, 20.8, 13.4, 11.7,  8.3, 10.2, 10.9, 11. ,\n",
      "        9.5, 14.5, 14.1, 16.1, 14.3, 11.7, 13.4,  9.6,  8.7,  8.4, 12.8,\n",
      "       10.5, 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. , 13.4,\n",
      "       15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9, 20. , 16.4, 17.7,\n",
      "       19.5, 20.2, 21.4, 19.9, 19. , 19.1, 19.1, 20.1, 19.9, 19.6, 23.2,\n",
      "       29.8, 13.8, 13.3, 16.7, 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8,\n",
      "       20.6, 21.2, 19.1, 20.6, 15.2,  7. ,  8.1, 13.6, 20.1, 21.8, 24.5,\n",
      "       23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9]), 'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',\n",
      "       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7'), 'DESCR': \".. _boston_dataset:\\n\\nBoston house prices dataset\\n---------------------------\\n\\n**Data Set Characteristics:**  \\n\\n    :Number of Instances: 506 \\n\\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\\n\\n    :Attribute Information (in order):\\n        - CRIM     per capita crime rate by town\\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\\n        - INDUS    proportion of non-retail business acres per town\\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\\n        - NOX      nitric oxides concentration (parts per 10 million)\\n        - RM       average number of rooms per dwelling\\n        - AGE      proportion of owner-occupied units built prior to 1940\\n        - DIS      weighted distances to five Boston employment centres\\n        - RAD      index of accessibility to radial highways\\n        - TAX      full-value property-tax rate per $10,000\\n        - PTRATIO  pupil-teacher ratio by town\\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town\\n        - LSTAT    % lower status of the population\\n        - MEDV     Median value of owner-occupied homes in $1000's\\n\\n    :Missing Attribute Values: None\\n\\n    :Creator: Harrison, D. and Rubinfeld, D.L.\\n\\nThis is a copy of UCI ML housing dataset.\\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\\n\\n\\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\\n\\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\\nprices and the demand for clean air', J. Environ. Economics & Management,\\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\\npages 244-261 of the latter.\\n\\nThe Boston house-price data has been used in many machine learning papers that address regression\\nproblems.   \\n     \\n.. topic:: References\\n\\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\\n\", 'filename': 'D:\\\\dev\\\\Anaconda3\\\\lib\\\\site-packages\\\\sklearn\\\\datasets\\\\data\\\\boston_house_prices.csv'}\n"
     ]
    }
   ],
   "source": [
    "print(boston)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "##### 取特征X和标签y"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "outputs": [],
   "source": [
    "X = boston.data\n",
    "y = boston.target"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 1.2 从文件读取\n",
    "绝大多数情况，数据是存在文件中的，如excel。所以我们也可以从文件中读取数据。一般使用pandas读取。"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "outputs": [],
   "source": [
    "import pandas as pd"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "outputs": [
    {
     "data": {
      "text/plain": "     Unnamed: 0     CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  \\\n0             0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1   \n1             1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2   \n2             2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2   \n3             3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3   \n4             4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3   \n..          ...      ...   ...    ...   ...    ...    ...   ...     ...  ...   \n501         501  0.06263   0.0  11.93     0  0.573  6.593  69.1  2.4786    1   \n502         502  0.04527   0.0  11.93     0  0.573  6.120  76.7  2.2875    1   \n503         503  0.06076   0.0  11.93     0  0.573  6.976  91.0  2.1675    1   \n504         504  0.10959   0.0  11.93     0  0.573  6.794  89.3  2.3889    1   \n505         505  0.04741   0.0  11.93     0  0.573  6.030  80.8  2.5050    1   \n\n     TAX  PTRATIO       B  LSTAT  price  \n0    296     15.3  396.90   4.98   24.0  \n1    242     17.8  396.90   9.14   21.6  \n2    242     17.8  392.83   4.03   34.7  \n3    222     18.7  394.63   2.94   33.4  \n4    222     18.7  396.90   5.33   36.2  \n..   ...      ...     ...    ...    ...  \n501  273     21.0  391.99   9.67   22.4  \n502  273     21.0  396.90   9.08   20.6  \n503  273     21.0  396.90   5.64   23.9  \n504  273     21.0  393.45   6.48   22.0  \n505  273     21.0  396.90   7.88   11.9  \n\n[506 rows x 15 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Unnamed: 0</th>\n      <th>CRIM</th>\n      <th>ZN</th>\n      <th>INDUS</th>\n      <th>CHAS</th>\n      <th>NOX</th>\n      <th>RM</th>\n      <th>AGE</th>\n      <th>DIS</th>\n      <th>RAD</th>\n      <th>TAX</th>\n      <th>PTRATIO</th>\n      <th>B</th>\n      <th>LSTAT</th>\n      <th>price</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>0.00632</td>\n      <td>18.0</td>\n      <td>2.31</td>\n      <td>0</td>\n      <td>0.538</td>\n      <td>6.575</td>\n      <td>65.2</td>\n      <td>4.0900</td>\n      <td>1</td>\n      <td>296</td>\n      <td>15.3</td>\n      <td>396.90</td>\n      <td>4.98</td>\n      <td>24.0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>0.02731</td>\n      <td>0.0</td>\n      <td>7.07</td>\n      <td>0</td>\n      <td>0.469</td>\n      <td>6.421</td>\n      <td>78.9</td>\n      <td>4.9671</td>\n      <td>2</td>\n      <td>242</td>\n      <td>17.8</td>\n      <td>396.90</td>\n      <td>9.14</td>\n      <td>21.6</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2</td>\n      <td>0.02729</td>\n      <td>0.0</td>\n      <td>7.07</td>\n      <td>0</td>\n      <td>0.469</td>\n      <td>7.185</td>\n      <td>61.1</td>\n      <td>4.9671</td>\n      <td>2</td>\n      <td>242</td>\n      <td>17.8</td>\n      <td>392.83</td>\n      <td>4.03</td>\n      <td>34.7</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>3</td>\n      <td>0.03237</td>\n      <td>0.0</td>\n      <td>2.18</td>\n      <td>0</td>\n      <td>0.458</td>\n      <td>6.998</td>\n      <td>45.8</td>\n      <td>6.0622</td>\n      <td>3</td>\n      <td>222</td>\n      <td>18.7</td>\n      <td>394.63</td>\n      <td>2.94</td>\n      <td>33.4</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>4</td>\n      <td>0.06905</td>\n      <td>0.0</td>\n      <td>2.18</td>\n      <td>0</td>\n      <td>0.458</td>\n      <td>7.147</td>\n      <td>54.2</td>\n      <td>6.0622</td>\n      <td>3</td>\n      <td>222</td>\n      <td>18.7</td>\n      <td>396.90</td>\n      <td>5.33</td>\n      <td>36.2</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>501</th>\n      <td>501</td>\n      <td>0.06263</td>\n      <td>0.0</td>\n      <td>11.93</td>\n      <td>0</td>\n      <td>0.573</td>\n      <td>6.593</td>\n      <td>69.1</td>\n      <td>2.4786</td>\n      <td>1</td>\n      <td>273</td>\n      <td>21.0</td>\n      <td>391.99</td>\n      <td>9.67</td>\n      <td>22.4</td>\n    </tr>\n    <tr>\n      <th>502</th>\n      <td>502</td>\n      <td>0.04527</td>\n      <td>0.0</td>\n      <td>11.93</td>\n      <td>0</td>\n      <td>0.573</td>\n      <td>6.120</td>\n      <td>76.7</td>\n      <td>2.2875</td>\n      <td>1</td>\n      <td>273</td>\n      <td>21.0</td>\n      <td>396.90</td>\n      <td>9.08</td>\n      <td>20.6</td>\n    </tr>\n    <tr>\n      <th>503</th>\n      <td>503</td>\n      <td>0.06076</td>\n      <td>0.0</td>\n      <td>11.93</td>\n      <td>0</td>\n      <td>0.573</td>\n      <td>6.976</td>\n      <td>91.0</td>\n      <td>2.1675</td>\n      <td>1</td>\n      <td>273</td>\n      <td>21.0</td>\n      <td>396.90</td>\n      <td>5.64</td>\n      <td>23.9</td>\n    </tr>\n    <tr>\n      <th>504</th>\n      <td>504</td>\n      <td>0.10959</td>\n      <td>0.0</td>\n      <td>11.93</td>\n      <td>0</td>\n      <td>0.573</td>\n      <td>6.794</td>\n      <td>89.3</td>\n      <td>2.3889</td>\n      <td>1</td>\n      <td>273</td>\n      <td>21.0</td>\n      <td>393.45</td>\n      <td>6.48</td>\n      <td>22.0</td>\n    </tr>\n    <tr>\n      <th>505</th>\n      <td>505</td>\n      <td>0.04741</td>\n      <td>0.0</td>\n      <td>11.93</td>\n      <td>0</td>\n      <td>0.573</td>\n      <td>6.030</td>\n      <td>80.8</td>\n      <td>2.5050</td>\n      <td>1</td>\n      <td>273</td>\n      <td>21.0</td>\n      <td>396.90</td>\n      <td>7.88</td>\n      <td>11.9</td>\n    </tr>\n  </tbody>\n</table>\n<p>506 rows × 15 columns</p>\n</div>"
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel('data/boston.xls')\n",
    "df"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "outputs": [],
   "source": [
    "X = df[df.columns[1:-1]]\n",
    "y = df[df.columns[-1]]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 2、数据预处理(数据清洗)\n",
    "\n",
    "\n",
    "我们获取的数据有可能存在下面的一些情况：\n",
    "  - 缺少数据值\n",
    "  - 含有错误的数据值，如年龄=200\n",
    "  - 数据不一致，等级编码有的是“1，2，3”有的却是“A，B，C ”\n",
    "  - 重复的记录值\n",
    "\n",
    "$\\color{red}{注意：本门课程关注的是机器学习算法，\n",
    "而波士顿房价数据也是清理过得，所以该部分不用写代码进行处理}$"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 3、数据分析与可视化\n",
    "\n",
    "$\\color{red}{注意：本门课程关注的是机器学习算法，<br/>\n",
    "不是数据分析，因此忽略数据分析与可视化部分}$"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 4、选择合适的机器学习模型\n",
    "\n",
    "该问题是房价预测问题，线性回归能很好的应用于预测问题，因此我们选择使用线性回归模型"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "outputs": [],
   "source": [
    "model = linear_model.Ridge(alpha=0.1)\n",
    "model.fit(X,y)\n",
    "y_hat = model.predict(X)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "我们如何选择参数alpha呢？"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 5、训练模型(使用交叉验证选择合适的参数)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "outputs": [
    {
     "data": {
      "text/plain": "GridSearchCV(cv=5, estimator=Ridge(),\n             param_grid={'alpha': [0.01, 0.03, 0.05, 0.07, 0.1, 0.5, 0.8, 1],\n                         'normalize': [True, False]},\n             scoring='neg_mean_squared_error')"
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "ridge_model = linear_model.Ridge()\n",
    "param_test={'alpha':[0.01,0.03,0.05,0.07,0.1,0.5,0.8,1],'normalize':[True,False]}\n",
    "gsearch = GridSearchCV(estimator=ridge_model,\n",
    "                       param_grid = param_test, scoring='neg_mean_squared_error', cv=5)\n",
    "gsearch.fit(X_train,y_train)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "outputs": [
    {
     "data": {
      "text/plain": "({'alpha': 0.01, 'normalize': True}, -23.967307928951563)"
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gsearch.best_params_, gsearch.best_score_"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 6、模型评价"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MSE= 28.66653040674006\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import mean_squared_error\n",
    "ridge_model = linear_model.Ridge(alpha=0.3,normalize=True)\n",
    "ridge_model.fit(X_train,y_train)\n",
    "y_test_pred = ridge_model.predict(X_test)\n",
    "print('MSE=',mean_squared_error(y_test,y_test_pred))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 7、上线部署使用"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "1、模型保存"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "outputs": [
    {
     "data": {
      "text/plain": "['house_train_model.m']"
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# from sklearn.externals import joblib\n",
    "import joblib\n",
    "joblib.dump(ridge_model, \"house_train_model.m\")"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "2、模型读取"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "outputs": [
    {
     "data": {
      "text/plain": "array([30.54318607, 15.80492533, 15.55575797,  8.70313191, 25.45727193,\n       22.09054164, 20.44398766, 22.36797665, 17.70675485, 17.33832749,\n       21.85942936, 22.32322506, 28.77966266, 24.21137367, 18.61880857,\n       19.12911378, 19.27639879, 14.82650254, 21.12511789, 19.49540693,\n       24.75008889, 18.33657028, 17.93935097, 27.70193856, 24.80218834,\n       24.96595823, 26.44745469, 22.04584747, 29.58252454, 23.71710241,\n       16.86679929, 28.46631139, 21.77800139, 24.99756218, 21.44663265,\n       30.98535165, 25.54054255, 25.33960268, 26.75076779, 14.10384243,\n       32.13129884, 19.08860519, 34.68512938, 25.69893324, 30.66248259,\n       16.76186255, 19.53103247, 15.4746292 , 13.05594993, 25.64250118,\n       14.53059475, 30.05876453, 31.21927663, 17.42760968, 30.43072681,\n       32.33872584, 20.49848541, 23.49751093, 36.8512449 , 19.77613453,\n       15.07883593, 18.36436294, 24.54630644, 33.07061392, 16.78686039,\n       23.92638489, 25.79388268,  8.60703637, 21.17445542, 21.72684622,\n       21.73719336, 25.45172304, 29.50838721, 16.77828296, 17.12730893,\n       21.74102991, 25.34767005, 19.05728897, 24.66245838, 32.38860687,\n       16.27760477, 16.51547058, 22.51455855, 23.91541668, 23.11046839,\n       30.20360244, 27.66223719, 43.44947917, 19.40123994, 17.67894213,\n       22.72047899, 29.24881886, 19.38312308, 35.89924428, 33.07077859,\n       24.53110327, 25.66458789, 30.09588296, 23.79521965, 12.74726728,\n       18.25777279, 32.65357224, 19.16756768, 27.85490145, 22.87651499,\n       20.2965688 , 23.35185352, 29.295646  , 26.1544832 , 18.15389709,\n       13.31383799, 15.9557041 , 21.93500209, 23.01150497, 17.61896888,\n       26.09463971, 20.36082456, 12.17999905, 19.49782475, 17.87380477,\n       31.33473946, 17.91330584, 16.77507361, 31.28572433, 24.1881641 ,\n       24.01670599, 13.39364317, 17.54964638, 24.83096437, 20.30851982,\n       37.68448645, 18.23073971,  8.26611025, 17.89938909, 20.84057254,\n        9.78343187, 22.58634272, 16.96945093, 25.11507582, 23.57932554,\n       23.84014156, 18.78122765, 25.12105315, 35.37581582, 27.57714404,\n       31.70300075, 22.97373842, 26.00065972, 18.94300229, 33.35660647,\n       29.06517579, 16.64578899])"
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf = joblib.load(\"house_train_model.m\")\n",
    "clf.predict(X_test)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}